Robust parameters for noisy speech recognition

ABSTRACT

The present invention relates to a method of automatic processing of noise-affected speech comprising at least the following steps:  
     capture and digitising of the speech in the form of at least one digitised signal ( 1 ),  
     extraction of several time-based sequences or frames ( 15 ), corresponding to said signal, by means of an extraction system ( 10 ),  
     decomposition of each frame ( 15 ) by means of an analysis system ( 20, 40 ) into at least two different frequency bands so as to obtain at least two first vectors of representative parameters ( 45 ) for each frame ( 15 ), one for each frequency band, and  
     conversion, by means of converter systems ( 50 ), of the first vectors of representative parameters ( 45 ) into second vectors of parameters relatively insensitive to noise ( 55 ), each converter system ( 50 ) being associated with one frequency band and converting the first vector of representative parameters ( 45 ) associated with said same frequency band, and  
     the learning of said converter systems ( 50 ) being achieved on the basis of a learning corpus which corresponds to a corpus of speech contaminated by noise ( 102 ).

SUBJECT OF THE INVENTION

[0001] The present invention relates to a method and to a system forautomatic speech processing.

STATE OF THE ART

[0002] Automatic speech processing comprises all the methods whichanalyse or generate speech by software or hardware means. At the presenttime, the main application fields of speech-processing methods are:

[0003] (1) speech recognition, which allows machines to “understand”human speech, and more particularly to transcribe the text which hasbeen spoken (ASR—“Automatic Speech Recognition” systems);

[0004] (2) recognition of the speaker, which permits to determine,within a group of persons, (or even to authenticate) a person who hasspoken;

[0005] (3) language recognition (French, German, English, etc), whichpermits to determine the language used by a person;

[0006] (4) speech coding, which has the main aim of facilitating thetransmission of a voice signal by reducing the memory size necessary forthe storage of said voice signal and by reducing its binary digit rate;

[0007] (5) speech synthesis, which allows to generate a speech signal,for example starting from a text.

[0008] In present-day speech recognition systems, the first stepconsists of digitising the voice signal recorded by a microphone. Next,an analysis system calculates vectors of parameters representative ofthis digitised voice signal. These calculations are performed at regularintervals, typically every 10 milliseconds, by analysis of shorttime-based signal sequences, called frames, of about 30 milliseconds ofdigitised signal. The analysis of the voice signal will therefore leadto a sequence of vectors of representative parameters, with one vectorof representative parameters per frame. These vectors of representativeparameters are then compared with reference models. This comparisongenerally makes use of a statistical approach based on the principle ofhidden Markov models (HMMs).

[0009] These models represent basic lexical units such as phonemes,diphones, syllables or others, and possibly permit to estimateprobabilities or likelihoods for these basic lexical units. These modelscan be considered as bricks allowing the construction of words orphrases. A lexicon permits to define words on the basis of these bricks,and a syntax allows to define the arrangements of words capable ofconstituting phrases. The variables defining these models are generallyestimated by training on the basis of a learning corpus consisting ofrecorded speech signals. It is also possible to use knowledge ofphonetics or linguistics to facilitate the definition of the models andthe estimating of their parameters.

[0010] Different sources of variability make the recognition taskdifficult, for example, voice differences from one person to the other,poor pronunciation, local accents, speech-recording conditions andambient noise.

[0011] Hence, even if the use of conventional automatic speechrecognition systems under well-controlled conditions generally givessatisfaction, the error rate of such systems, however, increasessubstantially in the presence of noise. This increase is all the greaterthe higher the noise level. Indeed, the presence of noise leads todistortions of the vectors of representative parameters. As thesedistortions are not present in the models, the performances of thesystem are degraded.

[0012] Numerous techniques have been developed in order to reduce thesensitivity of these systems to noise. These various techniques can beregrouped into five main families, depending on the principle which theyuse.

[0013] Among these techniques, a first family aims to perform aprocessing the purpose of which is to obtain either a substantiallynoise-free version of a noisy signal recorded by a microphone or severalmicrophones, or to obtain a substantially noise-free (compensated)version of the representative parameters (J. A. Lim & A. V. Oppenheim,“Enhancement and bandwidth compression of noisy speech”, Proceedings ofthe IEEE, 67(12):1586-1604, December 1979). One example of embodimentusing this principle is described in the document EP-0 556 992. Althoughvery useful, these techniques nevertheless exhibit the drawback ofintroducing distortions as regards the vectors of representativeparameters, and are generally insufficient to allow recognition indifferent acoustic environments, and in particular in the case of highnoise levels.

[0014] A second family of techniques relates to the obtaining ofrepresentative parameters which are intrinsically less sensitive to thenoise than the parameters conventionally used in the majority ofautomatic speech-recognition systems (H. Hermansky, N. Morgan & H. G.Hirsch, “Recognition of speech in additive and concolutional noise basedon rasta spectral processing”, in Proc. IEEE Intl. Conf. on Acoustics,Speech, and Signal Processing, pages 83-86, 1993; O. Viiki, D. Bye & K.Laurila, “A recursive feature vector normalization approach for robustspeech recognition in noise”, in Proc. of ICASSP'98, pages 733-736,1998). However, these techniques exhibit certain limits related to thehypotheses on which they are based.

[0015] A third family of techniques has also been proposed. Thesetechniques, instead of trying to transform the representativeparameters, are based on the transformation of the parameters of themodels used in the voice-recognition systems so as to adapt them to thestandard conditions of use (A. P. Varga & R. K. Moore, “Simultaneousrecognition of current speech signals using hidden Markov modeldecomposition”, in Proc. of EUROSPEECH'91, pages 1175-1178, Genova,Italy, 1991; C. J. Leggeter & P. C. Woodland, “Maximum likelihood linearregression for speaker adaptation”, Computer Speech and Language,9:171-185, 1995). These adaptation techniques are in fact rapid-learningtechniques which present the drawback of being effective, only if thenoise conditions vary slowly. Indeed, these techniques require severaltens of seconds of noisy speech signal in order to adapt the parametersof the recognition models. If, after this adaptation, the noiseconditions change again, the recognition system will no longer becapable of correctly associating the vectors of representativeparameters of the voice signal and the models.

[0016] A fourth family of techniques consists in conducting an analysiswhich permits to obtain representative parameters of frequency bands (H.Bourlard & S. Dupont, “A new ASR approach based on independentprocessing and recombination of partial frequency bands” in Proc. ofIntl. Conf. on Spoken Language Processing, pages 422-425, Philadelphia,October 1996). Models can then be developed for each of these bands; thebands together should ideally cover the entire useful frequencyspectrum, in other words up to 4 or 8 kHz. The benefit of thesetechniques, which will be called “multi-band (techniques)” hereafter istheir ability to minimise, in a subsequent decision phase, thesignificance of heavily noise-affected frequency bands. However, thesetechniques are hardly efficient when the noise covers a wide range ofthe useful frequency spectrum. Examples of methods belonging to thisfamily are given in the documents of Tibrewala et al. (“Sub-band basedrecognition of noisy speech” IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP), US, Los Alamitos, IEE Comp. Soc.Press, Apr. 21, 1997, pages 1255-1258) and of Bourlard et al.“Subband-based speech recognition”, IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP), US, Los Alamitos, IEEComp. Soc. Press, Apr. 21, 1997, pages 1251-1254).

[0017] Finally, a fifth family of techniques consists in contaminatingthe whole or part of the learning corpus, by adding noise at severaldifferent noise levels, and in estimating the parameters of the modelsused in the ASR system on the basis of this noise-affected corpus (T.Morii & H. Hoshimi, “Noise robustness in speaker independent speech”, inProc. of the Intl. Conf. on Spoken Language Processing, pages 1145-1148,November 1990). Examples of embodiments using this principle aredescribed in the document EP-A-0 881 625, the document U.S. Pat. No.5,185,848, as well as in the document of Yuk et al.(“Environment-independent continuous speech recognition using neuralnetworks and hidden Markov models”, IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP), US, New York, IEE,vol. Conf. 21, May 7, 1996, pages 3358-3361). In particular, thedocument of Yuk et al. proposes to use a network of artificial neuronswith the purpose of transforming representative parameters, obtained bythe analysis, into noise-free parameters mode or simply better adaptedto the recognition system downstream. The parameters of this neuronalnetwork are estimated on the basis of a reduced number of adaptationphrases (from 10 to 100 phrases in order to obtain good performance).The advantage of these techniques is that their performance is nearoptimal when the noise characterising the conditions of use is similarto the noise used to contaminate the learning corpus. On the other hand,when the two noises are different, the method is of little benefit. Thescope of application of these techniques is therefore unfortunatelylimited, to the extent that it cannot be envisaged carrying outcontamination on the basis of diversified noise which would cover allthe noises likely to be encountered during use.

[0018] The document from Hussain A. (“Non-linear sub-band processing forbinaural adaptive speech-enhancement” ICANN99. Ninth InternationalConference on Artificial Neuronal Networks (IEE Conf. Publ. No. 470),Edinburgh, UK, Sep. 7-10, 1999, pages 121-125, vol. 1) does notdescribe, as such, a speech-signal analysis method intended for voicerecognition and/or for speech coding, but it describes a particularmethod of removing noise from speech, in order to obtain a noise-freetime-based signal. More precisely, the method corresponds to a“multi-band” noise-removing approach, which consists of using a bank offilters producing time-based signals, said time-based signalssubsequently being processed by linear or non-linear adaptive filters,that is to say filters adapting to the conditions of use. This methodtherefore operates on the speech signal itself, and not on vectors ofrepresentative parameters of this signal obtained by analysis. Thenon-linear filters used in this method are conventional artificialneuronal networks or networks using expansion functions. Recourse toadaptive filters exhibits several drawbacks. A first drawback is thatthe convergence of the algorithms for adapting artificial neuronalnetworks is slow in comparison to the modulation frequencies of certainambient noise types, which renders them unreliable. Another drawback isthat the adaptive approach, as mentioned in the document, requires amethod of the “adapt-and-freeze” type, so as to adapt only during theportions of signal which are free from speech. This means making adistinction between the portions of signal with speech and the portionsof signal without speech, which is difficult to implement with thecurrently available speech-detection algorithms, especially when thenoise level is high.

AIMS OF THE INVENTION

[0019] The present invention aims to propose a method of automaticspeech processing in which the error rate is substantially reduced ascompared to the techniques of the state of the art.

[0020] More particularly, the present invention aims to provide a methodallowing speech recognition in the presence of noise (sound coding andnoise removal), whatever the nature of this noise, that is to say evenif the noise features wide-band characteristics and/or even if thesecharacteristics vary greatly in the course of time, for example if it iscomposed of noise containing essentially low frequencies followed bynoise containing essentially high frequencies.

MAIN CHARACTERISTICS OF THE INVENTION

[0021] The present invention relates to a method of automatic processingof noise-affected speech comprising at least the following steps:

[0022] capture and digitising of the speech in the form of at least onedigitised signal,

[0023] extraction of several time-based sequences or frames,corresponding to said signal, by means of an extraction system,

[0024] decomposition of each frame by means of an analysis system intoat least two different frequency bands so as to obtain at least twofirst vectors of representative parameters for each frame, one for eachfrequency band, and

[0025] conversion, by means of converter systems, of the first vectorsof representative parameters into second vectors of parametersrelatively insensitive to noise, each converter system being associatedwith one frequency band and converting the first vector ofrepresentative parameters associated with said same frequency band, andthe learning of said converter systems being achieved on the basis of alearning corpus which corresponds to a corpus of speech contaminated bynoise.

[0026] The decomposition step into frequency bands in the methodaccording to the present invention is fundamental in order to ensurerobustness when facing different types of noise.

[0027] Preferably, the method according to the present invention furthercomprises a step for concatenating the second vectors of representativeparameters which are relatively insensitive to noise, associated withthe different frequency bands of the same frame so as to have no morethan one single third vector of concatenated parameters for each framewhich is then used as input in an automatic speech-recognition system.

[0028] The conversion, by the use of converter systems, can be achievedby linear transformation or by non-linear transformation.

[0029] Preferably, the converter systems are artificial neuronalnetworks.

[0030] The use of artificial neuronal networks, trained on the basis ofnoisy speech data, features the advantage of not requiring an“adaptative” approach as described in the document of Hussain A. (op.cit. ) for adapting their parameters to the conditions of use.

[0031] Moreover, in contrast with the artificial neuronal networks usedin the method described by Hussain A. (op. cit. ), the neuronal networksas used in the present invention operate on representative vectorsobtained by analysis and not directly on the speech signal itself. Thisanalysis has the advantage of greatly reducing the redundancy present inthe speech signal and of allowing representation of the signal on thebasis of vectors of representative parameters of relatively restricteddimensions.

[0032] Preferably, said artificial neuronal networks are of multi-layerperceptron type and each comprises at least one hidden layer.

[0033] Advantageously, the learning of said artificial neuronal networksof the multi-layer perceptron type relies on targets corresponding tobasic lexical units for each frame of the learning corpus, the outputvectors of the last hidden layer or layers of said artificial neuronalnetworks being used as vectors of representative parameters which arerelatively insensitive to the noise.

[0034] The originality of the method of automatic speech processingaccording to the present invention lies in the combination of twoprinciples, the “multi-band” decomposition and the contamination bynoise, which are used separately in the state of the art and, as such,offer only limited benefit, while their combination of them gives tosaid method particular properties and performance which are clearlyenhanced with respect to the currently available methods.

[0035] Conventionally, the techniques for contaminating the trainingdata require a corpus correctly covering the majority of the noisesituations which may arise in practice (this is called multi-styletraining), which is practically impossible to realise, given thediversity of the noise types. On the other hand, the method according tothe invention is based on the use of a “multi-band” approach whichjustifies the contamination techniques.

[0036] The method according to the present invention is in fact based onthe observation that, if a relatively narrow frequency band isconsidered, the noises will differ essentially only as to their level.Therefore, models associated with each of the frequency bands of thesystem can be trained after contamination of the learning corpus by anynoise at different levels; these models will remain relativelyinsensitive to other types of noise. A subsequent decision step willthen use said models, called “robust models”, for automatic speechrecognition.

[0037] The present invention also relates to an automaticspeech-processing system comprising at least:

[0038] an acquisition system for obtaining at least one digitised speechsignal,

[0039] an extraction system, for extracting several time-based sequencesor frames corresponding to said signal,

[0040] means for decomposing each frame into at least two differentfrequency bands in order to obtain at least two first vectors ofrepresentative parameters, one vector for each frequency band, and

[0041] several converter systems, each converter system being associatedwith one frequency band for converting the first vector ofrepresentative parameters associated with this same frequency band intoa second vector of parameters which are relatively insensitive to thenoise, and

[0042] the learning by the said converter systems being achieved on thebasis of a corpus of noise-contaminated speech.

[0043] Preferably, the converter systems are artificial neuronalnetworks, preferably of the multi-layer perceptron type.

[0044] Preferably, the automatic speech-processing system according tothe invention further comprises means allowing the concatenation of thesecond vectors of representative parameters which are relativelyinsensitive to the noise, associated with different frequency bands ofthe same frame in order to have no more than one single third vector ofconcatenated parameters for each frame, said third vector then beingused as input in an automatic speech-recognition system.

[0045] It should be noted that, with the architecture of the analysisbeing similar for all the frequency bands, only the block diagram forone of the frequency bands is detailed here.

[0046] The automatic processing system and method according to thepresent invention can be used for speech recognition, for speech codingor for removing noise from speech.

BRIEF DESCRIPTION OF THE DRAWINGS

[0047]FIG. 1 presents a diagram of the first steps of automatic speechprocessing according to a preferred embodiment of the present invention,going from the acquisition of the speech signal up to the obtaining ofthe representative parameters which are relatively insensitive to thenoise associated with each of the frequency bands.

[0048]FIG. 2 presents the principle of contamination of the learningcorpus by noise, according to one preferred embodiment of the presentinvention.

[0049]FIG. 3 presents a diagram of the automatic speech-processing stepswhich follow the steps of FIG. 1 according to one preferred embodimentof the present invention, for a speech-recognition application, andgoing from the concatenation of the noise-insensitive representativeparameters associated with each of the frequency bands to therecognition decision.

[0050]FIG. 4 presents the automatic speech-processing steps which followthe steps of FIG. 1 according to one preferred embodiment of theinvention, and which are common to coding, noise-removal andspeech-recognition applications.

DESCRIPTION OF A PREFERRED EMBODIMENT OF THE INVENTION

[0051] According to one preferred embodiment of the invention, as FIG. 1shows, the signal 1, sampled at a frequency of 8 kHz first passesthrough a windowing module 10 constituting an extraction system whichdivides the signal into a succession of time-based 15- to 30-ms frames(240 samples). Two successive frames overlap by 20 ms. The elements ofeach frame are weighted by a Hamming window.

[0052] Next, in a first digital-processing step, a critical-bandanalysis is performed on each sampled-signal frame by means of a module20. This analysis is representative of the frequency resolution scale ofthe human ear. The approach used is inspired on the first analysis phaseof the PLP technique (H. Hermansky, “Perpetual linear predictive (PLP)analysis speech”, the Journal of the Acoustical Society of America,87(4):1738-1752, April 1992). It operates in the frequency domain. Thefilters used are trapezoidal and the distance between the centralfrequencies follows a psychoacoustic frequency scale. The distancebetween the central frequencies of two successive filters is set at 0.5Bark in this case, the Bark frequency (B) being able to be obtained bythe expression:

B=6 ln(f/600+sqrt ((f/600){circumflex over ( )}2+1))

[0053] where (f) is the frequency in Hertz.

[0054] Other values could nevertheless be envisaged.

[0055] For a signal sampled at 8 kHz, this analysis leads to a vector 25comprising the energies of 30 frequency bands. The procedure alsoincludes an accentuation of the high frequencies.

[0056] This vector of 30 elements is then dissociated into sevensub-vectors of representative parameters of the spectral envelope inseven different frequency bands. The following decomposition is used:1-4 (the filters indexed from 1 to 4 constitute the first frequencyband), 5-8, 9-12, 13-16, 17-20, 21-24 and 25-30 (the frequencies coveredby these seven bands are given in FIG. 1).

[0057] Each sub-vector is normalised by dividing the values of itselements by the sum of all the elements of the sub-vector, that is tosay by an estimate of the energy of the signal in the frequency band inquestion. This normalisation confers upon the sub-vector insensitivityas regarding the energy level of the signal.

[0058] For each frequency band, the representative parameters finallyconsist of the normalised sub-vector corresponding to the band, as wellas the estimate of the energy of the signal in this band.

[0059] For each of the seven frequency bands, the processing describedabove is performed by a module 40 which supplies a vector 45 ofrepresentative parameters of the band in question. The module 40 defineswith the module 20 a system called analysis system.

[0060] The modules 10, 20 and 40 could be replaced by any other approachmaking it possible to obtain representative parameters of differentfrequency bands.

[0061] For each frequency band, the corresponding representativeparameters are then used by a converter system 50 the purpose of whichis to estimate a vector 55 of representative parameters which arerelatively insensitive to the noise present in the sampled speechsignal.

[0062] As FIG. 3 shows, the vectors of representative parameters whichare insensitive to the noise associated with each of the frequency bandsare then concatenated in order to constitute a larger vector 56.

[0063] This large vector 56 is finally used as a vector ofrepresentative parameters of the frame in question. It could be used bythe module 60 which corresponds to a speech-recognition system and ofwhich the purpose is to supply the sequence of speech units which havebeen spoken.

[0064] In order to realise the desired functionality, an artificialneuronal network (ANN) (B. D. Ripley, “Pattern recognition and neuronalnetworks”, Cambridge University Press, 1996) has been used as theimplementation of the converter system 50. In general, the ANNcalculates vectors of representative parameters according to an approachsimilar to the one of non-linear discriminant analysis (V. Fontaine, C.Ris & J. M. Boite, “Non-linear discriminant analysis for improved speechrecognition”, in Proc. of EUROSPEECH'97, Rhodes, Greece, 1997).Nevertheless, other linear-transformation or non-linear-transformationapproaches, not necessarily involving an ANN, could equally be suitablefor calculating the vectors of representative parameters, such as forexample linear-discriminant-analysis techniques (Fukunaga, Introductionto Statistical Pattern Analysis, Academic Press, 1990), techniques ofanalysis in terms of principal components (I. T. Jolliffe, “PrincipalComponent Analysis”, Springer-Verlag, 1986) or regression techniquesallowing the estimation of a noise-free version of the representativeparameters (H. Sorensen, “A cepstral noise reduction multi-layerneuronal network”, Proc. of the IEEE International Conference onAcoustics, Speech and Signal Processing, vol. 2, p. 933-936, 1991).

[0065] More precisely, the neuronal network used here is a multi-layerperceptron comprising two layers of hidden neurons. The non-linearfunctions of the neurons of this perceptron are sigmoids. The ANNcomprises one output per basic lexical unit.

[0066] This artificial neuronal network is trained by theretro-propagation algorithm on the basis of a criterion of minimisingthe relative entropy. The training or learning is supervised and relieson targets corresponding to the basic lexical units of the presentedtraining examples. More precisely, for each training or learning frame,the output of the desired ANN corresponding to the conventional basiclexical unit is set to 1, the other outputs being set to zero.

[0067] In the present case, the basic lexical units are phonemes.However, it is equally possible to use other types of units, such asallophones (phonemes in a particular phonetic context) or phonetictraits (nasalisation, frication).

[0068] As illustrated in FIG. 2, the parameters of this ANN areestimated on the basis of a learning corpus 101 contaminated by noise102 by means of module 100. So as to cover a majority of the noiselevels likely to be encountered in practice, six versions of thelearning corpus are used here.

[0069] One of the versions is used as it is, that is to say withoutadded noise. The other versions have noise added by the use of themodule 100 at different signal/noise ratios: 0 dB, 5 dB, 10 dB, 15 dBand 20 dB. These six versions are used to train the ANN. These trainingdata are used at the input to the system presented in FIG. 1.

[0070] This system makes it possible to obtain representative parameters45 of the various frequency bands envisaged. It is these parameterswhich feed the artificial neuronal networks and especially allowtraining by retro-propagation (B. D. Ripley, “Pattern recognition andneuronal networks”, Cambridge University Press, 1996).

[0071] It should be noted that all the techniques which are generallyemployed when neuronal networks are used in speech processing can beapplied here. Hence, it has been chosen here to apply, as input of theANN, several, more precisely 9, vectors of representative parameters ofsuccessive signal frames, (so as to model the time-based correlation ofthe speech signal).

[0072] When an ANN is being used, an approach similar to that of thenon-linear discriminant analysis is employed. The outputs of the secondhidden layer, 30 in number, are used as parameters 55 which areinsensitive to the noise for the associated frequency band.

[0073] As FIG. 3 shows, in a first application, the vectors ofparameters associated with each of the seven frequency bands are thenconcatenated so as to lead to a vector 56 of 210 concatenatedparameters.

[0074] At each signal frame, this vector is then used as input for anautomatic speech-recognition system 60. This system is trained on thebasis of representative parameters calculated by the technique describedabove (system illustrated in FIG. 1) on the basis of a corpus of speech(noise-affected or otherwise) in keeping with the desired recognitiontask.

[0075] It should be noted that the corpus of data allowing developmentof the systems 50 associated with each frequency band is not necessarilythe same as that serving for the training of the voice-recognitionsystem 60.

[0076] All types of robust techniques of the state of the art may play apart freely in the context of the system proposed here, as FIG. 1illustrates.

[0077] Hence, robust acquisition techniques, especially those based onarrays of microphones 2, may be of use in obtaining a relativelynoise-free speech signal.

[0078] Likewise, the noise-removal techniques such as spectralsubtraction 3 (M. Berouti, R. Schwartz & J. Makhoul, “Enhancement ofspeech corrupted by acoustic noise”, in Proc. of ICASSP'79, pages208-211, April 1979) can be envisaged.

[0079] Any technique 22 for calculation of intrinsically robustparameters or any technique 21 for compensation of the representativeparameters can likewise be used.

[0080] Thus, the modules 10, 20 and 40 can be replaced by anyother-technique allowing to obtain representative parameters ofdifferent frequency bands.

[0081] The more insensitive these parameters are to ambient noise, thebetter the overall system will behave.

[0082] In the context of the application to voice recognition, as FIG. 2shows, techniques 61 for adaptation of the models may likewise be used.

[0083] A procedure 62 for training the system on the basis of a corpusof speech contaminated by noise is likewise possible.

[0084] In a second application, as FIG. 4 shows, the “robust” parameters55 are used as input for a regression module 70 allowing estimating ofconventional representative parameters 75 which can be used in thecontext of speech-processing techniques. For a speech-coding ornoise-removal task, this system 70 could estimate the parameters of anautoregressive model of the speech signal. For a voice-recognition task,it is preferable to estimate cepstra, that is to say the values of thediscrete Fourier transform which is the inverse of the logarithms of thediscrete Fourier transform of the signal.

[0085] The regression model is optimised in the conventional way on thebasis of a corpus of speech, noise-affected or otherwise.

[0086] The ideal outputs of the regression module are calculated on thebasis of non-noise affected data.

[0087] All the operations described above are performed by softwaremodules running on a single microprocessor. Furthermore, any otherapproach can be used freely.

[0088] It is possible, for example, to envisage a distributed processingin which the voice-recognition module 60 runs on a nearby or remoteserver to which the representative parameters 55 are supplied by way ofa data-processing or telephony network.

1. A method of automatic processing of noise-affected speech comprisingat least the following steps: capture and digitising of the speech inthe form of at least one digitised signal (1), extraction of severaltime-based sequences or frames (15), corresponding to said signal, bymeans of an extraction system (10), decomposition of each frame (15) bymeans of an analysis system (20, 40) into at least two differentfrequency bands so as to obtain at least two first vectors ofrepresentative parameters (45) for each frame (15), one for eachfrequency band, and conversion, by means of converter systems (50), ofthe first vectors of representative parameters (45) into second vectorsof parameters relatively insensitive to noise (55), each convertersystem (50) being associated with one frequency band and converting thefirst vector of representative parameters (45) associated with said samefrequency band, and the learning of said converter systems (50) beingachieved on the basis of a learning corpus which corresponds to a corpusof speech contaminated by noise (102).
 2. The method according to claim1, characterised in that it further comprises a step of concatenation ofthe second vectors of representative parameters which are relativelyinsensitive to noise (55), associated with the different frequency bandsof the same frame (15) so as to have no more than one single thirdvector of concatenated parameters (56) for each frame (15) which is thenused as input in an automatic speech-recognition system (60).
 3. Themethod according to claim 1 or 2, characterised in that the conversion,by means of converter systems (50), is achieved by linear transformationor by non-linear transformation.
 4. The method according to one ofclaims 1 to 3, characterised in that the converter systems (50) areartificial neuronal networks.
 5. The method according to claim 4,characterised in that the said artificial neuronal networks are of themulti-layer perceptron type and each comprises at least one hiddenlayer.
 6. The method according to claim 5, characterised in that thelearning by the said artificial neuronal networks of the multi-layerperceptron type relies on targets corresponding to the basic lexicalunits for each frame of the learning corpus, the output vectors of thelast hidden layer or layers of the said artificial neuronal networksbeing used as vectors of representative parameters which are relativelyinsensitive to the noise.
 7. An automatic speech-processing systemcomprising at least: an acquisition system for obtaining at least onedigitised speech signal (1), an extraction system (10), for extractingseveral time-based sequences or frames (15) corresponding to said signal(1), means (20, 40) for decomposing each frame (15) into at least twodifferent frequency bands so as to obtain at least two first vectors ofrepresentative parameters (45), one vector for each frequency band, andseveral converter systems (50), each converter system (50) beingassociated with one frequency band and making it possible to convert thefirst vector of representative parameters (45) associated with this samefrequency band into a second vector of parameters which are relativelyinsensitive to noise (55), and the learning by the said convertersystems (50) being achieved on the basis of a corpus of speech corruptedby noise (102).
 8. The automatic speech-processing system according toclaim 7, characterised in that the converter systems (50) are artificialneuronal networks, preferably of the multi-layer perceptron type.
 9. Theautomatic speech-processing system according to claim 7 or claim 8,characterised in that it further comprises means allowing theconcatenation of the second vectors of representative parameters whichare relatively insensitive to noise (55), associated with differentfrequency bands of the same frame (15) so as to have no more than onesingle third vector of concatenated parameters (56) for each frame (15),said third vector then being used as input into an automaticspeech-recognition system (60).
 10. Use of the method according to oneof claims 1 to 6 and/or of the system according to one of claims 7 to 9for speech recognition.
 11. Use of the method according to one of claims1, 3 to 6 and/or of the system according to one of claims 7 or 8 forspeech coding.
 12. Use of the method according to one of claims 1, 3 to6 and/or of the system according to one of claims 7 or 8 for removingnoise from speech.